System for effective use of data for personalization

ABSTRACT

Off-policy evaluation of a new “target” policy is performed using historical data gathered based on a previous “logging” policy to estimate the performance of the target policy. An estimator may be used, wherein either a quality-based estimator or a quality-agnostic estimator is used to weight the difference between an observed reward in the historical data and an estimated reward generated by the target policy. A quality-agnostic estimator may be used to evaluate an importance weight according to a threshold. In such examples, when the importance weight exceeds the threshold, the quality-agnostic estimator clips the importance weight at the threshold, thereby providing an fixed upper bound irrespective of the quality of the reward predictor. In other examples, a quality-based estimator is used, in which an upper bound incorporates the quality of the reward predictor in order to modify an importance weight used by the estimator.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of U.S. patentapplication Ser. No. 16/657,533, filed on Oct. 18, 2019, which claimspriority to U.S. Provisional Application No. 62/861,843, titled “ASystem for Effective Use of Data for Personalization,” filed on Jun. 14,2019, the entire disclosures of all are hereby incorporated byreference.

BACKGROUND

In applications using the contextual bandit protocol, a logging policyis used to take action based on a given context, thereby earning areward. In some instances, a logging policy is evaluated based on anaverage reward metric. However, generating an updated or new policy (a“target policy”) that achieves a similar or improved average rewardmetric is difficult without costly A/B testing and large datasets.

It is with respect to these and other general considerations that theaspects disclosed herein have been made. Also, although relativelyspecific problems may be discussed, it should be understood that theexamples should not be limited to solving the specific problemsidentified in the background or elsewhere in this disclosure.

SUMMARY

This disclosure describes systems and methods for evaluating a policyand generating a policy with improved performance. In some examples,off-policy evaluation is performed using historical data gathered basedon a previous algorithm (e.g., a “logging policy”) in order to estimatethe performance of an updated algorithm (e.g., a “target policy”). Anestimator may be used, wherein an importance weight is used to weightthe difference between an observed reward in the historical data and anestimated reward generated by the target policy (e.g., as may beestimated by a reward predictor). In examples, the approach involvesreducing an importance weight to improve a bound on the mean squarederror (MSE). In some examples, a quality-agnostic estimator is used toevaluate an importance weight according to a threshold. In suchexamples, when the importance weight exceeds the threshold, thethreshold is used as the importance weight, thereby providing an upperbound that is irrespective of the quality of the reward predictor. Inother examples, a quality-based estimator is used, in which an upperbound incorporates the quality of the reward predictor in order tomodify an importance weight used by the estimator.

It will be appreciated that, in another example, the off-policyestimates can be used to generate a new target policy. In examples, anestimator that improves a bound on the MSE may be selected and used togenerate the target policy accordingly. Aspects described herein forevaluating policies and for generating an optimized policy are adaptiveto different applications and contexts. Such an approach may result inevaluation and optimization that is more accurate given the same amountof historical data from a previous logging policy. Therefore, less datamay be necessary to perform adequate evaluation and optimization ofpolicies, thereby reducing difficulties associated with collecting dataand potential issues arising from outdated or stale data.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter. Additionalaspects, features, and/or advantages of examples will be set forth inpart in the description which follows and, in part, will be apparentfrom the description, or may be learned by practice of the disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting and non-exhaustive examples are described with reference tothe following figures.

FIG. 1 illustrates an overview of an example system for improvedpersonalization techniques according to aspects described herein.

FIG. 2A illustrates an overview of an example method for evaluating atarget policy based on historical data collected that was generatedaccording to a logging policy.

FIG. 2B illustrates an overview of an example method for determining ahyperparameter for off-policy evaluation.

FIG. 2C illustrates an overview of an example method for selecting anestimator class for off-policy evaluation.

FIG. 3 is a block diagram illustrating example physical components of acomputing device with which aspects of the disclosure may be practiced.

FIGS. 4A and 4B are simplified block diagrams of a mobile computingdevice with which aspects of the present disclosure may be practiced.

FIG. 5 is a simplified block diagram of a distributed computing systemin which aspects of the present disclosure may be practiced.

FIG. 6 illustrates a tablet computing device for executing one or moreaspects of the present disclosure.

DETAILED DESCRIPTION

Various aspects of the disclosure are described more fully below withreference to the accompanying drawings, which form a part hereof, andwhich show specific exemplary aspects. However, different aspects of thedisclosure may be implemented in many different forms and should not beconstrued as limited to the aspects set forth herein; rather, theseaspects are provided so that this disclosure will be thorough andcomplete, and will fully convey the scope of the aspects to thoseskilled in the art. Aspects may be practiced as methods, systems ordevices. The following detailed description is, therefore, not to betaken in a limiting sense.

In examples of a contextual bandit protocol, a policy evaluates acontext, determines an action based at least in part on the context, andaccrues reward. As used herein, a policy (or “decision maker”) may be apre-existing “logging” policy or a new “target” policy. In examples, apolicy is generated with the goal of increasing or maximizing theprobability of accruing a reward for a given context (e.g., based on thedetermined action). As an example, determining an action may compriseselecting a content item from a set of content items for presentation toa user. Accordingly, if the user engages with the content item, a rewardis earned. However, if the user does not engage with the content item,no reward is earned. Thus, a reward may be represented as I′ if the userengages with the content item or ‘0’ if the user does not. Thus, in suchan example, a policy that increases or maximizes the likelihood ofaccruing a reward entails presenting content that is likely to result inuser engagement.

In examples, a context space from which a context is observed may beuncountably large. In other examples, it may be assumed that an actionspace from which an action is determined is finite. As used herein,context may relate to user activity on a user device, such as clickingon links, opening emails, sending emails, opening applications,interactions within applications, a location or time at which anapplication is used, and other user activity performed on the userdevice. Accordingly, example actions include, but are not limited to,presenting a content item (e.g., a link to a website, textual content,graphical content, video content, audio content, target content, etc.),an application or an action within an application on the user device, ora contact, among other actions (e.g., ranking a set of content items).If user interaction results from the determined action (e.g., a userclicks on a link, calls a contact, etc.), a reward is incurred. Asanother example, an initial reward may be incurred if the user interactswith the presented content, and a subsequent reward may be incurred ifthe user initiates and/or completes a purchase associated with thepresented content. It will be appreciated that while examples aregenerally described herein with a singular action, similar techniquesmay be used to determine a set of actions (e.g., presenting multiplecontent items, multiple applications or actions within an application,multiple contacts, etc.). In such examples, an associated reward isincurred if any one action of the set of actions results in a userinteraction.

A target policy may be evaluated using off-policy evaluation. Inexamples, the goal of off-policy evaluation is to use historical datagathered by a past policy (e.g., a logging policy) in order to estimatethe performance of the new target policy. As an example, historical datamay include, but is not limited to, a set of triples, wherein eachtriple comprises a context, an associated action, and a resultingreward. It will be appreciated that historical information may includeadditional, less, or alternative information relating to a userinteraction associated with the logging policy. Returning to the aboveexample, off-policy evaluation of a target policy may be performed basedon historical data relating to user interactions with content identifiedaccording to a logging policy. Thus, off-policy evaluation of the targetpolicy comprises determining a predicted likelihood that the targetpolicy will identify content with which a user will engage, based onhistorical interactions resulting from the logging policy. Accordingly,if the target policy appears to have a higher likelihood of userengagement (and therefore a higher predicted reward) than the loggingpolicy, the target policy may be implemented in place of the loggingpolicy.

High quality off-policy estimates may avoid costly AB testing and mayalso be used to generate an improved or optimized policy. As usedherein, an “optimized” policy refers to a policy that exhibits one ormore improved characteristics as compared to a previous policy. Forexample, a target policy may exhibit a reduced mean squared error or anincreased average reward as compared to a logging policy, among otherexamples. It will be appreciated that while examples herein aredescribed with respect to a user interacting with selected content, anyof a variety of contexts, actions, and rewards may be used according tothe present disclosure.

A challenge in off-policy evaluation is distribution mismatch, whereactions determined by a target policy for a given context may differfrom historical actions determined by the logging policy that was in-usewhen the historical data was collected. “Doubly robust estimation” is anexample estimator that may be used to address such challenges. Inexamples, doubly robust estimation uses a combination of inversepropensity scoring and direct modeling, where inverse propensity scoringmay be used to correct a distribution mismatch by reweighting the data,while direct modeling may be used to reduce the impact of largeimportance weights. As an example, direct modeling comprises generatingand using a regression model to predict rewards. In other examples, areward predictor may be trained and used to generate a predicted rewardfor a given context.

Doubly robust estimation may yield results that are less biased orunbiased, and that have a smaller variance than those achieved usinginverse propensity scoring. Further, doubly robust estimation may beasymptotically optimal under weaker assumptions than direct modeling.However, since doubly robust estimation may use the same importanceweights as inverse propensity scoring, its variance can still be high,unless the reward predictor is highly accurate. Accordingly, in someexamples, doubly robust estimation may be further improved according toaspects described herein by clipping or removing large importanceweights, such as by using either a quality-agnostic estimator or aquality-based estimator to weight reward predictions used by a doublyrobust estimation model. While weight clipping or shrinking may incur asmall bias, it may also substantially decrease the variance, which canresult in a lower mean squared error (MSE) than implementing doublyrobust estimation without such techniques. This disclosure presentssystems and methods that improve off-policy evaluation through weightclipping.

Aspects described herein allow for better evaluation and optimization ofpolicies. For example, less historical data associated with a loggingpolicy may be required to evaluate a target policy, thereby reducingcomputational overhead associated with acquiring and processing thehistorical data. Additionally, as a result of efficiently using lesshistorical data for off-policy evaluation, the complexities of obtainingrelevant and current historical data are reduced, thereby minimizing theimpact of potentially stale data on the evaluation. Further, as notedabove, costly AB testing can be avoided, among other benefits.

Off-policy evaluation techniques described herein may also be performedusing an associated hyperparameter. In examples, the approach presentedherein involves shrinking the importance weights to optimize a bound onthe MSE of the estimator. Two classes of estimators are describedherein. The first estimator class is not tied to the quality of thereward predictor and is referred to herein as a “quality-agnostic”estimator or doubly robust estimation with “pessimistic shrinkage.” Thequality-agnostic estimator uses a threshold to evaluate an importanceweight. In examples, when the importance weight exceeds the threshold,the threshold is used in place of the importance weight, therebyclipping importance weights that are above the threshold. However, ifthe importance weight does not exceed the threshold, the importanceweight is not changed. As an example, a quality-agnostic estimatorŵ_(p,λ)(x,a) may be modeled by the equation below:

ŵ _(p,λ)(x,a)=min{λ,w(x,a)}

In the above equation, p indicates that the estimator isquality-agnostic (e.g., it does not account for the quality of thereward predictor), λ represents a value of the hyperparameter (e.g., thethreshold at which weights are clipped), x represents a context for acontextual bandit policy, a represents the action associated with thecontext x (e.g., as may have been generated by a logging policy), w(x,a)represents the importance weight, and ŵ_(p,λ)(x,a) represents the newimportance weight according to the quality-agnostic estimator. It willbe appreciated that the above equation is provided as an exampleequation with which to implement doubly robust estimation withpessimistic shrinkage.

In other aspects, a second class of estimator is used, which is referredto herein as a quality-based estimator or doubly robust estimation with“optimistic shrinkage.” Unlike the quality-agnostic estimator, thequality-based estimator uses an upper bound that is based on the qualityof the reward predictor. For example, an importance weight generatedaccording to a quality-based estimator may be modified or generated in away that incorporates the original importance weight. In some examples,the approach herein may bound the bias and variance in terms of theweighted square loss. As an example, a quality-based estimatorŵ_(o,λ)(x,a) may be modeled by the equation below:

${{\overset{\hat{}}{w}}_{0,\lambda}\left( {x,\ a} \right)} = {\frac{\lambda}{{w^{2}\left( {x,a} \right)} + \lambda}{w\left( {x,\ a} \right)}}$

In the above equation, o indicates that the estimator is quality-based(e.g., it relies on the quality of the reward predictor), λ represents avalue of the hyperparameter, x represents a context for a contextualbandit policy, a represents the action associated with the context x(e.g., as may have been generated by a logging policy), w(x,a)represents the importance weight, and ŵ_(o,λ)(x,a) represents the newimportance weight according to the quality-based estimator. It will beappreciated that the above equation is provided as an example equationwith which to implement doubly robust estimation with optimisticshrinkage.

As discussed above, each class of estimator may include a hyperparameterin order to reduce or, in some examples, eliminate bias and variancefrom the estimator. In some examples, a model selection procedure isused to tune the hyperparameter (e.g., λ in the above example equations)and/or determine which estimator class to use (e.g., a quality-agnosticestimator, a quality-based estimator, etc.). As an example, the modelselection process comprises selecting a value for the hyperparameter anddetermining which estimator to use, such that the resulting off-policyevaluation model yields an MSE below a target threshold or, in someexamples, within a predetermined range. Accordingly, the techniquesdescribed herein achieve a similar or improved result as compared toprevious solutions. Additionally, such off-policy evaluation techniquesmay exhibit improved finite-sample performance and may therefore achieveresults that are comparable to other techniques using comparatively lesshistorical data.

Similar techniques may be used in instances in which a policy is used todetermine a list of actions rather than a single action. As an example,the overall reward for the list (which may vary according to the actionsselected and their positions) may be estimated by decomposing it intoindividual contributions of each action. The contributions of allpossible actions may be estimated by applying a weight matrix to thevector representation of the list of actions selected by the loggingpolicy. The resulting contribution estimates can then be combined toevaluate the reward for the list of actions chosen by the target policy.An example equation {circumflex over (r)}(x,S) for use in such scenariosis provided below:

{circumflex over (r)}(x,S)=r(x,S)+

₌₁{circumflex over (ϕ)}_(x) [js _(j)]

In the above equation, x is a context, S is a list of actions (s₁, . . ., s_(l)), {tilde over (r)}(x,S) represents a reward predictor,{circumflex over (ϕ)}_(x)[js_(j)] represents the contribution of actions j in slot j from the list of actions S, such that the summation of{circumflex over (ϕ)}_(x) yields the overall contribution according toeach action in the list relative to what is already captured by r. Itwill be appreciated that the above equation is provided as an exampleand that, in other examples, different techniques may be used to applyaspects described herein to evaluate a list of actions.

In addition to evaluating a target policy, aspects described herein maybe used to generate an improved or optimized policy. In examples, theimproved policy is generated based on an analysis of historical datagathered according to a logging policy. As an example, an adaptivedecision algorithm is used to determine actions for given contexts. Theadaptive decision algorithm is tuned according to context and associatedactions from the historical data. In other examples, the decisionalgorithm may make decisions that are randomized, at least in part, whendetermining an action, thereby increasing the likelihood of exploringdifferent possible actions. In another example, an exploration budget isused, wherein the performance of the adaptive decision algorithm isevaluated (e.g., according to the one or more estimator classes andassociated techniques described herein) as compared to a default policy(e.g., a logging policy, according to a fixed decision algorithm basedon the historical data, etc.). In some examples, if the performance ofthe adaptive decision algorithm exceeds the exploration budget, thedefault policy may be used to determine actions instead until theperformance is back within the exploration budget, thereby limiting thepotential impact of performance degradation.

FIG. 1 illustrates an overview of an example system 100 for improvedpersonalization techniques according to aspects described herein. Asillustrated, system 100 comprises server device 102 and user device 104.In examples, server device 102 and user device 104 communicate using anetwork, such as a local area network, a wireless network, or theInternet, or any combination thereof. In an example, user device 104 isany of a variety of computing devices, including, but not limited to, amobile computing device, a laptop computing device, a tablet computingdevice, or a desktop computing device. In other examples, server device102 is a computing device, including, but not limited to, a desktopcomputing device or a distributed computing device. It will beappreciated that while system 100 is illustrated as comprising oneserver device 102 and one user device 104, any number of devices may beused in other examples.

User device 104 is illustrated as comprising client application 106 andcontextual data store 108. Client application 106 may be a web browseror a messaging application, among other example applications. Inexamples, client application 106 communicates with server 102 to accesscontent to display to a user of user device 104. As the user engageswith client application 106, user interactions may be stored incontextual data store 108. Thus, contextual data store 108 may store anyof a variety of context information, including, but not limited to,accessed links, opened emails, sent emails, opened applications,interactions within applications, a location or time at which anapplication is used, and other activities on user device 104.

Server device 102 is illustrated as comprising action generation engine110, policy evaluation engine 112, and historical data store 114. Inexamples, action generation engine 110 uses a policy (e.g., a loggingpolicy) to generate an action according to a given context. For example,user device 104 may provide context information from contextual datastore 108, which is used by action generation engine 110 to determine anaction. An indication of the action may be provided to user device 104.User device 104 may then generate a display according to the action(e.g., presenting a content item to a user, displaying a ranked list ofcontent items, suggesting a content or application, etc.). Accordingly,depending on the outcome of the selected action, action generationengine 110 may receive an indication of an associated reward. In anotherexample, action generation engine 110 may use a context determined atserver device 102 (e.g., associated with a user account, with a specificcookie, with a session, etc.) instead of or in addition to contextreceived from user device 104. In examples, action generation engine 110logs historical data and stores such information in historical datastore 114. As described above, historical data may be stored in the formof triples, comprising a context, a determined action, and an associatedreward. It will be appreciated that, in other examples, additional,less, or alternative information may be stored as historical data inhistorical data store 114.

Server device 104 further comprises policy evaluation engine 112. Policyevaluation engine 112 implements aspects described herein to performoff-policy evaluation of a new policy (e.g., a target policy) accordingto historical data generated by an existing logging policy. In examples,policy evaluation engine 112 accesses historical data from historicaldata store 114 (e.g., as may have been generated by action generationengine 110). In examples, policy evaluation engine 112 performs a modelselection process to determine whether to apply an off-policy evaluationmodel with a quality-agnostic estimator or a quality-based estimator inits analysis of the target policy in view of the historical data.Additionally, policy evaluation engine 112 may determine the value of anoptional hyperparameter in order to further tune the model that is usedto evaluate the target policy. Ultimately, the target policy isevaluated according to the determined off-policy evaluation model inorder to compare the performance of the target policy to the loggingpolicy that is currently in use by action generation engine 110. Inexamples, the comparison comprises evaluating an average reward metric,wherein the average reward incurred by each policy (e.g., for a set ofcontexts) is compared to determine which policy incurs the highestaverage reward. If the target policy exhibits a higher average rewardmetric, the target policy may be used by action generation engine 110 inplace of the logging policy when generating subsequent actions accordingto given contexts. It will be appreciated that any of a variety of othermetrics may be used to compare a target policy and a logging policy,including, but not limited to, average variance or a total reward value.Additional example aspects of policy evaluation engine 112 are discussedin greater detail below with respect to FIGS. 2A-2C.

While example implementations are described above with respect to aserver computing device and/or a user device, it will be appreciatedthat any of a variety of other devices may be used to implement theaspects described herein. Similarly, while certain operations aredescribed with respect to either server device 102 or user device 104,it will be appreciated that aspects described herein may be split amongand performed by any of a wide array of computing device configurations.For example, aspects of action generation engine 110 may be performed byuser device 102 or, in another example, at least a subpart of contextualdata store 108 may reside on server device 102.

FIG. 2A illustrates an overview of an example method 200 for evaluatinga target policy based on historical data collected that was generatedaccording to a logging policy. Method 200 may be may be performed by oneor more computing devices, including, but not limited to, a personalcomputer, a laptop computer, a tablet computer, a mobile computingdevice, or a distributed computing device. As an example, aspects ofmethod 200 may be performed by server device 102 and/or user device 104in FIG. 1 . As another example, aspects of method 200 may be performedby policy evaluation engine 112 in FIG. 1 . Method 200 begins atoperation 202, where historical data associated with a logging policymay be accessed. For example, the data may comprise information relatingto a contextual bandit protocol, including a context, an action, areward, and/or the performance of the protocol, according to aspectsdisclosed herein. It will be appreciated that data may be accessed fromany of a variety of sources, including, but not limited to, a local datastore (e.g., contextual data store 108 in FIG. 1 ) or a remote datastore (e.g., historical data store 114), or any combination thereof.

At operation 204, a reward predictor is generated. As described above, areward predictor may generate an expected reward. For example, thereward predictor generates an expected reward using the historical dataassociated with the logging policy, as was accessed at operation 202.Thus, in examples, the reward predictor generates the predicted rewardgiven the context and the associated action that was determined by thelogging policy based on the context. In some examples, the rewardpredictor may be modelled as a regression function (e.g., as may be thecase when using a direct modeling approach).

It will be appreciated that, in other examples, multiple rewardpredictors may be generated at operation 204. For example, each rewardpredictor may use a different kind of regression function. As anexample, a first reward predictor may be generated using linearregression, whereas a second reward predictor may be generated accordingto a deep neural network. Other examples include using differentweighting techniques when generating the multiple reward predictors.Assuming a data weighting function z(x,a), where x is a context and a isan action, example weighting techniques include, but are not limited to,a uniform weighting function (e.g., z(x,a)=1), a weighting functionbased on importance weights (e.g., assuming weights are defined byfunction w( ) z(x,a)=w(x,a)), or a weighting function based on thesquare of importance weights (e.g., z(x,a)=w² (x,a)). Another exampleweighting function that may be used for policy optimization is:

${z\left( {x,a} \right)} = \frac{1}{m{u\left( {a{❘x}} \right)}}$

In the above example equation, mu( ) is a function describing theprobability of picking action a given a context x while applying a givenlogging policy. It will be appreciated that the above equations areprovided as examples and that any of a variety of other functions may beused according to aspects described herein.

At operation 206, the model selection process is performed using thegenerated reward predictor(s). As described herein, a set of one or moreestimators may be used, such as a quality-agnostic estimator and/or aquality-based estimator. In some examples, multiple quality-agnosticestimators are evaluated at operation 206, where each quality-agnosticestimator uses a different hyperparameter. Similarly, multiplequality-based estimators may be evaluated at operation 206, where eachquality-based estimator uses a different hyperparameter. Indeed, asdiscussed above, an estimator may include one or more hyperparametersthat are used to clip or shrink the importance weights of the rewardpredictor according to aspects described herein. The model selectionprocess further comprises evaluating each estimator to determine whichestimator yields an off-policy evaluation model with the smallest error.As an example, the set of estimators may be compared according to theleast squared error associated with each estimator. It will beappreciated that other techniques may be used to compare each of themodels and ultimately select a model with which to evaluate the targetpolicy As another example, models may be evaluated according to whichmodels exhibit an error below a certain threshold or within a certainrange. Additional example aspects of model selection are described inmore detail below with respect to FIGS. 2B and 2C. In examples, themodel selection may identify a policy that minimizes MSE, bias, andvariance. The model selection may also identify which class of estimatorto use.

Flow progresses to operation 208, where the target policy is evaluatedaccording to the model selected at operation 206. In examples, theevaluation comprises using the selected model to generate an averagereward metric for the target policy for a set of contexts from thehistorical data accessed at operation 202. It will be appreciated thatany of a variety of other metrics may be used to evaluate a targetpolicy, including, but not limited to, average variance or a totalreward value.

At determination 210, it is determined whether the target policy isexpected to perform better than the logging policy with which thehistorical data was generated. In examples, the determination comprisescomparing the average reward metric for the target policy (e.g., thatwas generated at operation 208) as compared to the average rewardincurred by the logging policy. If the target policy is better than thelogging policy (e.g., it exhibits a higher average reward metric thanthe logging policy), flow branches YES to operation 212, where thetarget policy is used in place of the logging policy. For example,action generation engine 110 in FIG. 1 may receive an indication to usethe target policy in place of the logging policy. Flow terminates atoperation 212. If, however, it is determined that the target policy isnot better than the logging policy, flow instead branches NO tooperation 214, where the logging policy continues to be used instead ofthe target policy. Flow terminates at operation 214.

FIG. 2B illustrates an overview of an example method 220 for determininga hyperparameter for off-policy evaluation. Method 220 may be may beperformed by one or more computing devices, including, but not limitedto, a personal computer, a laptop computer, a tablet computer, a mobilecomputing device, or a distributed computing device. As an example,aspects of method 220 may be performed by server device 102 and/or userdevice 104 in FIG. 1 . As another example, aspects of method 220 may beperformed by policy evaluation engine 112 in FIG. 1 . Aspects of method220 may be performed as part of the model selection process at operation206 in FIG. 2A.

Method 220 begins at operation 222, where a hyperparameter is selected.In examples, the hyperparameter is selected according to one or moreprevious off-policy evaluations. In another example, the hyperparametermay be selected based on a hyperparameter that was used to evaluate alogging policy with which historical data was collected. In otherexamples, the hyperparameter may be iteratively selected, where thehyperparameter is either iteratively increased or decreased according toachieve an error below a threshold or within a certain range. It will beappreciated that a variety of other techniques may be used to select thehyperparameter.

At operation 224, historical data is evaluated according to a modelbased on the selected hyperparameter. In examples, the evaluationcomprises using the hyperparameter with respect to one or more estimatorclasses, such as a quality-based estimator and/or a quality-agnosticestimator. The historical data may be accessed from a historical datastore, such as historical data store 114 in FIG. 1 . The evaluation maycomprise generated an MSE for the model according to the hyperparameter.It will be appreciated that, in other examples, different metrics may beused to evaluate the model.

Flow progresses to determination 226, where it is determined whether theMSE is below a certain threshold. The threshold may be a thresholdpreconfigured by a user, or may be programmatically determined (e.g.,based on evaluating the data according to a doubly robust estimatorwithout application of either a quality-based or quality-agnosticestimator). It will be appreciated that, in other examples, a range ofvalues may be used to determine whether the MSE is acceptable. If it isdetermined that the MSE is not below the threshold, flow branches NO andreturns to operation 222. As noted above, the hyperparameter selectionprocess may be iterative, such that an updated hyperparameter isdetermined at operation 222. Accordingly, flow loos between operations222, 224, and 226 until the MSE is below a threshold. In some examples,determination 226 further comprises a counter such that flow insteadbranches YES to operation 228 after a certain number of iterations.Ultimately, flow arrives at operation 228, which is discussed below.

If, however, it is determined that the MSE is below the threshold, flowinstead branches YES to operation 228, where the hyperparameter is usedfor off-policy evaluation according to aspects described herein. Forexample, performing the off-policy evaluation according to thedetermined hyperparameter may comprise performing the steps of method200 in FIG. 2A, as described above. Flow terminates at operation 228.

FIG. 2C illustrates an overview of an example method 240 for selectingan estimator class for off-policy evaluation. Method 240 may be may beperformed by one or more computing devices, including, but not limitedto, a personal computer, a laptop computer, a tablet computer, a mobilecomputing device, or a distributed computing device. As an example,aspects of method 240 may be performed by server device 102 and/or userdevice 104 in FIG. 1 . As another example, aspects of method 240 may beperformed by policy evaluation engine 112 in FIG. 1 . Aspects of method240 may be performed as part of the model selection process at operation206 in FIG. 2A.

Method 240 begins at operation 242, where historical data is evaluatedaccording to a quality-agnostic estimator. In examples, thequality-agnostic estimator clips weights that are placed on rewardpredictions according to aspects described herein. Operation 242 maycomprise generating an MSE associated with the evaluation of thehistorical data using the quality-agnostic estimator.

Flow progresses to operation 244, where the historical data is evaluatedaccording to a quality-based estimator. As discussed above, thequality-based estimator weights reward predictions according to thequality of the reward predictor. Similar to operation 242, operation 244may further comprise generating an MSE for the evaluation of thehistorical data using the quality-based estimator. It will beappreciated that, in other examples, a different metric may be used,such that both operation 242 and operation 244 generate a differentmetric for comparison.

Moving to determination 246, it is determined whether the quality-basedestimator yields better results than the quality-agnostic estimator.Method 240 is an example in which MSE is used to evaluate the twoestimators. Accordingly, the determination comprises evaluating the MSEfor each estimator to determine which estimator exhibits the lower MSE.If it is determined that the quality-based estimator yields a lower MSE,flow branches to operation 248, where the quality-based estimator isused to perform off-policy evaluation of a target policy (e.g., asdiscussed above with respect to method 200 in FIG. 2A). If, however, itis determined that the quality-agnostic estimator exhibits a lower MSEthan the quality-based estimator, flow instead branches to operation250, where the quality-agnostic estimator is used to perform off-policyevaluation of the target policy. Flow terminates at operation 248 or250.

FIGS. 3-6 and the associated descriptions provide a discussion of avariety of operating environments in which aspects of the disclosure maybe practiced. However, the devices and systems illustrated and discussedwith respect to FIGS. 3-6 are for purposes of example and illustrationand are not limiting of a vast number of computing device configurationsthat may be utilized for practicing aspects of the disclosure, describedherein.

FIG. 3 is a block diagram illustrating physical components (e.g.,hardware) of a computing device 300 with which aspects of the disclosuremay be practiced. The computing device components described below may besuitable computing devices for implementing aspects of the presentdisclosure described above. In a basic configuration, the computingdevice 300 may include at least one processing unit 302 and a systemmemory 304. Depending on the configuration and type of computing device,the system memory 304 may comprise, but is not limited to, volatilestorage (e.g., random access memory), non-volatile storage (e.g.,read-only memory), flash memory, or any combination of such memories.

The system memory 304 may include an operating system 305 and one ormore program modules 306 suitable for running software application 320,such as one or more components supported by the systems describedherein. As examples, system memory 304 may store client application 324and policy generator 326. For example, client application 324 maydisplay content determined according to an action of a logging policy. Auser may interact with such content, thereby incurring a reward. Suchinteractions may form a part of historical user interactions. Policygenerator 326 may implement aspects of method 200 in order to optimizethe logging model and/or generate a new target model according toaspects described herein. The operating system 305, for example, may besuitable for controlling the operation of the computing device 300.

Furthermore, embodiments of the disclosure may be practiced inconjunction with a graphics library, other operating systems, or anyother application program and is not limited to any particularapplication or system. This basic configuration is illustrated in FIG. 3by those components within a dashed line 308. The computing device 300may have additional features or functionality. For example, thecomputing device 300 may also include additional data storage devices(removable and/or non-removable) such as, for example, magnetic disks,optical disks, or tape. Such additional storage is illustrated in FIG. 3by a removable storage device 309 and a non-removable storage device310.

As stated above, a number of program modules and data files may bestored in the system memory 304. While executing on the processing unit302, the program modules 306 (e.g., application 320) may performprocesses including, but not limited to, the aspects, as describedherein. Other program modules that may be used in accordance withaspects of the present disclosure may include electronic mail andcontacts applications, word processing applications, spreadsheetapplications, database applications, slide presentation applications,drawing or computer-aided application programs, etc.

Furthermore, embodiments of the disclosure may be practiced in anelectrical circuit comprising discrete electronic elements, packaged orintegrated electronic chips containing logic gates, a circuit utilizinga microprocessor, or on a single chip containing electronic elements ormicroprocessors. For example, embodiments of the disclosure may bepracticed via a system-on-a-chip (SOC) where each or many of thecomponents illustrated in FIG. 3 may be integrated onto a singleintegrated circuit. Such an SOC device may include one or moreprocessing units, graphics units, communications units, systemvirtualization units and various application functionality all of whichare integrated (or “burned”) onto the chip substrate as a singleintegrated circuit. When operating via an SOC, the functionality,described herein, with respect to the capability of client to switchprotocols may be operated via application-specific logic integrated withother components of the computing device 300 on the single integratedcircuit (chip). Embodiments of the disclosure may also be practicedusing other technologies capable of performing logical operations suchas, for example, AND, OR, and NOT, including but not limited tomechanical, optical, fluidic, and quantum technologies. In addition,embodiments of the disclosure may be practiced within a general purposecomputer or in any other circuits or systems.

The computing device 300 may also have one or more input device(s) 312such as a keyboard, a mouse, a pen, a sound or voice input device, atouch or swipe input device, etc. The output device(s) 314 such as adisplay, speakers, a printer, etc. may also be included. Theaforementioned devices are examples and others may be used. Thecomputing device 300 may include one or more communication connections316 allowing communications with other computing devices 350. Examplesof suitable communication connections 316 include, but are not limitedto, radio frequency (RF) transmitter, receiver, and/or transceivercircuitry; universal serial bus (USB), parallel, and/or serial ports.

The term computer readable media as used herein may include computerstorage media. Computer storage media may include volatile andnonvolatile, removable and non-removable media implemented in any methodor technology for storage of information, such as computer readableinstructions, data structures, or program modules. The system memory304, the removable storage device 309, and the non-removable storagedevice 310 are all computer storage media examples (e.g., memorystorage). Computer storage media may include RAM, ROM, electricallyerasable read-only memory (EEPROM), flash memory or other memorytechnology, CD-ROM, digital versatile disks (DVD) or other opticalstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, or any other article of manufacturewhich can be used to store information and which can be accessed by thecomputing device 300. Any such computer storage media may be part of thecomputing device 300. Computer storage media does not include a carrierwave or other propagated or modulated data signal.

Communication media may be embodied by computer readable instructions,data structures, program modules, or other data in a modulated datasignal, such as a carrier wave or other transport mechanism, andincludes any information delivery media. The term “modulated datasignal” may describe a signal that has one or more characteristics setor changed in such a manner as to encode information in the signal. Byway of example, and not limitation, communication media may includewired media such as a wired network or direct-wired connection, andwireless media such as acoustic, radio frequency (RF), infrared, andother wireless media.

FIGS. 4A and 4B illustrate a mobile computing device 400, for example, amobile telephone, a smart phone, wearable computer (such as a smartwatch), a tablet computer, a laptop computer, and the like, with whichembodiments of the disclosure may be practiced. In some aspects, theclient may be a mobile computing device. With reference to FIG. 4A, oneaspect of a mobile computing device 400 for implementing the aspects isillustrated. In a basic configuration, the mobile computing device 400is a handheld computer having both input elements and output elements.The mobile computing device 400 typically includes a display 405 and oneor more input buttons 410 that allow the user to enter information intothe mobile computing device 400. The display 405 of the mobile computingdevice 400 may also function as an input device (e.g., a touch screendisplay).

If included, an optional side input element 415 allows further userinput. The side input element 415 may be a rotary switch, a button, orany other type of manual input element. In alternative aspects, mobilecomputing device 400 may incorporate more or less input elements. Forexample, the display 405 may not be a touch screen in some embodiments.

In yet another alternative embodiment, the mobile computing device 400is a portable phone system, such as a cellular phone. The mobilecomputing device 400 may also include an optional keypad 435. Optionalkeypad 435 may be a physical keypad or a “soft” keypad generated on thetouch screen display.

In various embodiments, the output elements include the display 405 forshowing a graphical user interface (GUI), a visual indicator 420 (e.g.,a light emitting diode), and/or an audio transducer 425 (e.g., aspeaker). In some aspects, the mobile computing device 400 incorporatesa vibration transducer for providing the user with tactile feedback. Inyet another aspect, the mobile computing device 400 incorporates inputand/or output ports, such as an audio input (e.g., a microphone jack),an audio output (e.g., a headphone jack), and a video output (e.g., aHDMI port) for sending signals to or receiving signals from an externaldevice.

FIG. 4B is a block diagram illustrating the architecture of one aspectof a mobile computing device. That is, the mobile computing device 400can incorporate a system (e.g., an architecture) 402 to implement someaspects. In one embodiment, the system 402 is implemented as a “smartphone” capable of running one or more applications (e.g., browser,e-mail, calendaring, contact managers, messaging clients, games, andmedia clients/players). In some aspects, the system 402 is integrated asa computing device, such as an integrated personal digital assistant(PDA) and wireless phone.

One or more application programs 466 may be loaded into the memory 462and run on or in association with the operating system 464. Examples ofthe application programs include phone dialer programs, e-mail programs,personal information management (PIM) programs, word processingprograms, spreadsheet programs, Internet browser programs, messagingprograms, and so forth. The system 402 also includes a non-volatilestorage area 468 within the memory 462. The non-volatile storage area468 may be used to store persistent information that should not be lostif the system 402 is powered down. The application programs 466 may useand store information in the non-volatile storage area 468, such ase-mail or other messages used by an e-mail application, and the like. Asynchronization application (not shown) also resides on the system 402and is programmed to interact with a corresponding synchronizationapplication resident on a host computer to keep the information storedin the non-volatile storage area 468 synchronized with correspondinginformation stored at the host computer. As should be appreciated, otherapplications may be loaded into the memory 462 and run on the mobilecomputing device 400 described herein (e.g., search engine, extractormodule, relevancy ranking module, answer scoring module, etc.).

The system 402 has a power supply 470, which may be implemented as oneor more batteries. The power supply 470 might further include anexternal power source, such as an AC adapter or a powered docking cradlethat supplements or recharges the batteries.

The system 402 may also include a radio interface layer 472 thatperforms the function of transmitting and receiving radio frequencycommunications. The radio interface layer 472 facilitates wirelessconnectivity between the system 402 and the “outside world,” via acommunications carrier or service provider. Transmissions to and fromthe radio interface layer 472 are conducted under control of theoperating system 464. In other words, communications received by theradio interface layer 472 may be disseminated to the applicationprograms 466 via the operating system 464, and vice versa.

The visual indicator 420 may be used to provide visual notifications,and/or an audio interface 474 may be used for producing audiblenotifications via the audio transducer 425. In the illustratedembodiment, the visual indicator 420 is a light emitting diode (LED) andthe audio transducer 425 is a speaker. These devices may be directlycoupled to the power supply 470 so that when activated, they remain onfor a duration dictated by the notification mechanism even though theprocessor 460 and other components might shut down for conservingbattery power. The LED may be programmed to remain on indefinitely untilthe user takes action to indicate the powered-on status of the device.The audio interface 474 is used to provide audible signals to andreceive audible signals from the user. For example, in addition to beingcoupled to the audio transducer 425, the audio interface 474 may also becoupled to a microphone to receive audible input, such as to facilitatea telephone conversation. In accordance with embodiments of the presentdisclosure, the microphone may also serve as an audio sensor tofacilitate control of notifications, as will be described below. Thesystem 402 may further include a video interface 476 that enables anoperation of an on-board camera 430 to record still images, videostream, and the like.

A mobile computing device 400 implementing the system 402 may haveadditional features or functionality. For example, the mobile computingdevice 400 may also include additional data storage devices (removableand/or non-removable) such as, magnetic disks, optical disks, or tape.Such additional storage is illustrated in FIG. 4B by the non-volatilestorage area 468.

Data/information generated or captured by the mobile computing device400 and stored via the system 402 may be stored locally on the mobilecomputing device 400, as described above, or the data may be stored onany number of storage media that may be accessed by the device via theradio interface layer 472 or via a wired connection between the mobilecomputing device 400 and a separate computing device associated with themobile computing device 400, for example, a server computer in adistributed computing network, such as the Internet. As should beappreciated such data/information may be accessed via the mobilecomputing device 400 via the radio interface layer 472 or via adistributed computing network. Similarly, such data/information may bereadily transferred between computing devices for storage and useaccording to well-known data/information transfer and storage means,including electronic mail and collaborative data/information sharingsystems.

FIG. 5 illustrates one aspect of the architecture of a system forprocessing data received at a computing system from a remote source,such as a personal computer 504, tablet computing device 506, or mobilecomputing device 508, as described above. Content displayed at serverdevice 502 may be stored in different communication channels or otherstorage types. For example, various documents may be stored using adirectory service 522, a web portal 524, a mailbox service 526, aninstant messaging store 528, or a social networking site 530.

A client application 520 may be employed by a client that communicateswith server device 502, and/or the policy generator 521 may be employedby server device 502. The server device 502 may provide data to and froma client computing device such as a personal computer 504, a tabletcomputing device 506 and/or a mobile computing device 508 (e.g., a smartphone) through a network 515. By way of example, the computer systemdescribed above may be embodied in a personal computer 504, a tabletcomputing device 506 and/or a mobile computing device 508 (e.g., a smartphone). Any of these embodiments of the computing devices may obtaincontent from the store 516, in addition to receiving graphical datauseable to be either pre-processed at a graphic-originating system, orpost-processed at a receiving computing system.

FIG. 6 illustrates an exemplary tablet computing device 600 that mayexecute one or more aspects disclosed herein. In addition, the aspectsand functionalities described herein may operate over distributedsystems (e.g., cloud-based computing systems), where applicationfunctionality, memory, data storage and retrieval and various processingfunctions may be operated remotely from each other over a distributedcomputing network, such as the Internet or an intranet. User interfacesand information of various types may be displayed via on-board computingdevice displays or via remote display units associated with one or morecomputing devices. For example, user interfaces and information ofvarious types may be displayed and interacted with on a wall surfaceonto which user interfaces and information of various types areprojected. Interaction with the multitude of computing systems withwhich embodiments of the invention may be practiced include, keystrokeentry, touch screen entry, voice or other audio entry, gesture entrywhere an associated computing device is equipped with detection (e.g.,camera) functionality for capturing and interpreting user gestures forcontrolling the functionality of the computing device, and the like.

As will be understood from the foregoing disclosure, one aspect of thetechnology relates to a system comprising at least one processor; andmemory storing instructions that, when executed by the at least oneprocessor, causes the system to perform a set of operations. The set ofoperations comprises: generating a reward predictor for historical dataassociated with a logging policy; determining an off-policy evaluationmodel, wherein the off-policy evaluation model comprises an estimatorselected from the group consisting of a quality-agnostic estimator and aquality-based estimator; evaluating, using the off-policy evaluationmodel, a target policy to determine whether an expected reward metric ofthe target policy is higher than a reward metric of the logging policy;and when it is determined that the expected reward metric is higher thanthe reward metric of the logging policy, generating an indication to usethe target policy instead of the logging policy. In an example,determining the off-policy evaluation model comprises: generating, forthe quality-agnostic estimator, a first mean squared error (MSE) metric;generating, for the quality-based estimator, a second MSE metric; whenthe first MSE is lower than the second MSE, selecting thequality-agnostic estimator as the estimator; and when the second MSE islower than the first MSE, selecting the quality-based estimator as theestimator. In another example, off-policy evaluation model comprises acombination of direct modeling of a reward predictor and inversepropensity scoring, and a weight of the reward predictor in theoff-policy evaluation model is determined according to the estimator. Ina further example, determining the off-policy evaluation model comprisesdetermining a hyperparameter for the estimator. In yet another example,the set of operations further comprises: receiving, from a user device,a second indication of a context; determining, according to the targetpolicy, an action based on the received context; and providing, inresponse to the first indication, a third indication of the determinedaction. In a further still example, the quality-agnostic estimatorcomprises a threshold at which an importance weight is clipped if theweight exceeds the threshold. In an example, the set of operationsfurther comprises: accessing the historical data from a historical datastore, wherein the historical data comprises at least one context, anaction associated with the context, and a reward for the action.

In another aspect, the technology relates to a method for selecting anew policy based on a previous policy. The method comprises: accessinghistorical data associated with the previous policy, the historical datacomprising at least one context, an action determined based on thecontext, and a reward for the action; evaluating, using an off-policyevaluation model, the new policy to determine whether to use the newpolicy instead of the previous policy, wherein the off-policy evaluationmodel comprises a combination of a direct model, inverse propensityscoring, and an estimator selected from the group consisting of aquality-agnostic estimator and a quality-based estimator; and based ondetermining that the new policy should be used instead of the previouspolicy: generating an action for a context according to the new policy;and providing an indication of the action to a user device. In anexample, the new policy is determined to be used instead of the oldpolicy when an average reward metric for the new policy is higher thanan average reward metric for previous policy, and the average rewardmetric for the new policy is determined using the off-policy evaluationmodel. In another example, the estimator of the off-policy evaluationmodel is selected by: generating, for the quality-agnostic estimator, afirst mean squared error (MSE) metric; generating, for the quality-basedestimator, a second MSE metric; when the first MSE is lower than thesecond MSE, selecting the quality-agnostic estimator as the estimator;and when the second MSE is lower than the first MSE, selecting thequality-based estimator as the estimator. In a further example, themethod further comprises: determining, for the selected estimator, ahyperparameter for the estimator, wherein the hyperparameter isdetermined by iteratively refining the hyperparameter to reduce the MSEof the selected estimator. In yet another example, the quality-agnosticestimator comprises a threshold at which an importance weight is clippedif the weight exceeds the threshold. In a further still example, thedirect model is a regression model for the historical data, and whereinthe inverse propensity scoring generates a weight for a predictedreward.

In another aspect, the technology relates to another method foroff-policy evaluation of a target policy. The method comprises:generating a reward predictor for historical data associated with alogging policy; determining an off-policy evaluation model, wherein theoff-policy evaluation model comprises an estimator selected from thegroup consisting of a quality-agnostic estimator and a quality-basedestimator; evaluating, using the off-policy evaluation model, the targetpolicy to determine whether an expected reward metric of the targetpolicy is higher than a reward metric of the logging policy; and when itis determined that the expected reward metric is higher than the rewardmetric of the logging policy, generating an indication to use the targetpolicy instead of the logging policy. In an example, determining theoff-policy evaluation model comprises: generating, for thequality-agnostic estimator, a first mean squared error (MSE) metric;generating, for the quality-based estimator, a second MSE metric; whenthe first MSE is lower than the second MSE, selecting thequality-agnostic estimator as the estimator; and when the second MSE islower than the first MSE, selecting the quality-based estimator as theestimator. In another example, off-policy evaluation model comprises acombination of direct modeling of a reward predictor and inversepropensity scoring, and a weight of the reward predictor in theoff-policy evaluation model is determined according to the estimator. Ina further example, determining the off-policy evaluation model comprisesgenerating a hyperparameter for the estimator. In yet another example,the method further comprises: receiving, from a user device, a secondindication of a context; determining, according to the target policy, anaction based on the received context; and providing, in response to thefirst indication, a third indication of the determined action. In afurther still example, the quality-agnostic estimator comprises athreshold at which an importance weight is clipped if the weight exceedsthe threshold. In an example, the method further comprises: accessingthe historical data from a historical data store, wherein the historicaldata comprises at least one context, an action associated with thecontext, and a reward for the action.

Aspects of the present disclosure, for example, are described above withreference to block diagrams and/or operational illustrations of methods,systems, and computer program products according to aspects of thedisclosure. The functions/acts noted in the blocks may occur out of theorder as shown in any flowchart. For example, two blocks shown insuccession may in fact be executed substantially concurrently or theblocks may sometimes be executed in the reverse order, depending uponthe functionality/acts involved.

The description and illustration of one or more aspects provided in thisapplication are not intended to limit or restrict the scope of thedisclosure as claimed in any way. The aspects, examples, and detailsprovided in this application are considered sufficient to conveypossession and enable others to make and use the best mode of claimeddisclosure. The claimed disclosure should not be construed as beinglimited to any aspect, example, or detail provided in this application.Regardless of whether shown and described in combination or separately,the various features (both structural and methodological) are intendedto be selectively included or omitted to produce an embodiment with aparticular set of features. Having been provided with the descriptionand illustration of the present application, one skilled in the art mayenvision variations, modifications, and alternate aspects falling withinthe spirit of the broader aspects of the general inventive conceptembodied in this application that do not depart from the broader scopeof the claimed disclosure.

What is claimed is:
 1. A system comprising: at least one processor; andmemory storing instructions that, when executed by the at least oneprocessor, causes the system to perform a set of operations, the set ofoperations comprising: generating a reward predictor for historical dataassociated with a logging policy; determining an off-policy evaluationmodel, wherein the off-policy evaluation model comprises an estimatorselected from the group consisting of a quality-agnostic estimator and aquality-based estimator; evaluating, using the off-policy evaluationmodel, a target policy to determine whether an expected reward metric ofthe target policy is higher than a reward metric of the logging policy;and when it is determined that the expected reward metric is higher thanthe reward metric of the logging policy, generating an indication to usethe target policy instead of the logging policy.
 2. The system of claim1, wherein determining the off-policy evaluation model comprises:generating, for the quality-agnostic estimator, a first mean squarederror (MSE) metric; generating, for the quality-based estimator, asecond MSE metric; when the first MSE is lower than the second MSE,selecting the quality-agnostic estimator as the estimator; and when thesecond MSE is lower than the first MSE, selecting the quality-basedestimator as the estimator.
 3. The system of claim 1, wherein theoff-policy evaluation model comprises a combination of direct modelingof a reward predictor and inverse propensity scoring, and wherein aweight of the reward predictor in the off-policy evaluation model isdetermined according to the estimator.
 4. The system of claim 1, whereindetermining the off-policy evaluation model comprises determining ahyperparameter for the estimator.
 5. The system of claim 1, wherein theset of operations further comprises: receiving, from a user device, asecond indication of a context; determining, according to the targetpolicy, an action based on the received context; and providing, inresponse to the first indication, a third indication of the determinedaction.
 6. The system of claim 1, wherein the quality-agnostic estimatorcomprises a threshold at which an importance weight is clipped if theweight exceeds the threshold.
 7. The system of claim 1, wherein the setof operations further comprises: accessing the historical data from ahistorical data store, wherein the historical data comprises at leastone context, an action associated with the context, and a reward for theaction.
 8. A method for selecting a new policy based on a previouspolicy, the method comprising: accessing historical data associated withthe previous policy, the historical data comprising at least onecontext, an action determined based on the context, and a reward for theaction; evaluating, using an off-policy evaluation model, the new policyto determine whether to use the new policy instead of the previouspolicy, wherein the off-policy evaluation model comprises a combinationof a direct model, inverse propensity scoring, and an estimator selectedfrom the group consisting of a quality-agnostic estimator and aquality-based estimator; and based on determining that the new policyshould be used instead of the previous policy: generating an action fora context according to the new policy; and providing an indication ofthe action to a user device.
 9. The method of claim 1, wherein the newpolicy is determined to be used instead of the old policy when anaverage reward metric for the new policy is higher than an averagereward metric for previous policy, and wherein the average reward metricfor the new policy is determined using the off-policy evaluation model.10. The method of claim 1, wherein the estimator of the off-policyevaluation model is selected by: generating, for the quality-agnosticestimator, a first mean squared error (MSE) metric; generating, for thequality-based estimator, a second MSE metric; when the first MSE islower than the second MSE, selecting the quality-agnostic estimator asthe estimator; and when the second MSE is lower than the first MSE,selecting the quality-based estimator as the estimator.
 11. The methodof claim 10, further comprising: determining, for the selectedestimator, a hyperparameter for the estimator, wherein thehyperparameter is determined by iteratively refining the hyperparameterto reduce the MSE of the selected estimator.
 12. The method of claim 1,wherein the quality-agnostic estimator comprises a threshold at which animportance weight is clipped if the weight exceeds the threshold. 13.The method of claim 1, wherein the direct model is a regression modelfor the historical data, and wherein the inverse propensity scoringgenerates a weight for a predicted reward.
 14. A method for off-policyevaluation of a target policy, the method comprising: generating areward predictor for historical data associated with a logging policy;determining an off-policy evaluation model, wherein the off-policyevaluation model comprises an estimator selected from the groupconsisting of a quality-agnostic estimator and a quality-basedestimator; evaluating, using the off-policy evaluation model, the targetpolicy to determine whether an expected reward metric of the targetpolicy is higher than a reward metric of the logging policy; and when itis determined that the expected reward metric is higher than the rewardmetric of the logging policy, generating an indication to use the targetpolicy instead of the logging policy.
 15. The method of claim 14,wherein determining the off-policy evaluation model comprises:generating, for the quality-agnostic estimator, a first mean squarederror (MSE) metric; generating, for the quality-based estimator, asecond MSE metric; when the first MSE is lower than the second MSE,selecting the quality-agnostic estimator as the estimator; and when thesecond MSE is lower than the first MSE, selecting the quality-basedestimator as the estimator.
 16. The method of claim 14, wherein theoff-policy evaluation model comprises a combination of direct modelingof a reward predictor and inverse propensity scoring, and wherein aweight of the reward predictor in the off-policy evaluation model isdetermined according to the estimator.
 17. The method of claim 14,wherein determining the off-policy evaluation model comprises generatinga hyperparameter for the estimator.
 18. The method of claim 14, furthercomprising: receiving, from a user device, a second indication of acontext; determining, according to the target policy, an action based onthe received context; and providing, in response to the firstindication, a third indication of the determined action.
 19. The methodof claim 14, wherein the quality-agnostic estimator comprises athreshold at which an importance weight is clipped if the weight exceedsthe threshold.
 20. The method of claim 14, further comprising: accessingthe historical data from a historical data store, wherein the historicaldata comprises at least one context, an action associated with thecontext, and a reward for the action.